Goto

Collaborating Authors

 data science problem


DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

Wang, He, Li, Alexander Hanbo, Hu, Yiqun, Zhang, Sheng, Kobayashi, Hideo, Zhang, Jiani, Zhu, Henry, Hang, Chung-Wei, Ng, Patrick

arXiv.org Artificial Intelligence

Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning -- a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves -- to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent's learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.


Defining data science: a new field of inquiry

Brodie, Michael L

arXiv.org Artificial Intelligence

Data Systems Laboratory, School of Engineering and Applied Sciences Harvard University, Cambridge, MA USA =============DRAFT July 12, 2023 ====================== Data science is not a science. It is a research with which to define, unify, and evolve data paradigm. We benefits - the basis of a comprehensive soluWon have yet to understand and define it. Modern data science is in its infancy. Emerging 1. Challenges defining data science slowly since 1962 and rapidly since 2000, data 1.1. Due to its problem solving techniques is rare. Science and value, power, and scope of applicability, it is modern scienWfic analyses emerged 400 years ago emerging in over 40 disciplines, hundreds of and interpreWvism and interpreWvist analysis 200 research areas, and tens of thousands of years ago. While convenWonal data science is as applicaWons. Yet we are just beginning to old as mathemaWcs, AI-based data science is in its understand and define it. Tukey's 1962 vision of exploratory data publicaWons contain myriad definiWons of data analysis[20][21] brought renewed a`enWon to science and data science problem solving. Aaer its infancy, many definiWons are independent, 2000, machine learning-based data science led to applicaWon-specific, mutually incomplete, a fundamentally new, inscrutable field of inquiry redundant, or inconsistent, hence so is data that we are just beginning to understand. This has led to calls and a data science journal[31] for the data science for a unifying framework to guide unificaWon. An community to achieve such a definiWon. This paper provides candidate definiWons for What is such a unifying framework? How do you essenWal data science arWfacts that are required define a fundamentally new field of inquiry? For to discuss such a definiWon. They are based on the this we look to science, our currently most classical research paradigm concept[15] consisWng powerful knowledge discovery paradigm. of a philosophy of data science, the data science problem solving paradigm, and the six component 1.2. ACM lists 200+ data science journals. This required paradigms that were and Aristotle (384-322 BC)) then in terms of accepted by scienWsts to guide the unificaWon of scienWfic models, theories, and the scienWfic the myriad definiWons based on established method by Francis Bacon [Novum Organum 1620] results.


A data science axiology: the nature, value, and risks of data science

Brodie, Michael L.

arXiv.org Artificial Intelligence

Data Systems Laboratory, School of Engineering and Applied Sciences Harvard University, Cambridge, MA USA =============DRAFT July 18, 2023====================== Data science is not a science. It is a research a theory of value that defines the nature, value, paradigm. As data science is in its surpass science - our most powerful research infancy, its axiology can only be speculated. Such paradigm - in enabling knowledge discovery that an axiology can aid in understanding and defining is changing our world[10]. This paper explores and data science and recognizing potenUal benefits, evaluates its remarkable, definiUve features. We present the history and nature of data science and offer Modern data science is in its infancy. Emerging candidate definiUons of essenUal data science slowly since 1962 and rapidly since 2000, data concepts required to discuss its axiology. Within a science is a fundamentally new field of inquiry, decade, this remarkable new research paradigm one of the most acUve, powerful, and rapidly will be seen as a milestone in human knowledge evolving innovaUons of the 21st century. Yet we are just beginning to data science as a Promethean Moment[10] that understand and define it. Due to based on single invenUons, e.g., the prinUng press, its infancy, many definiUons are independent, this moment is based on a meta-technology Essen'al data science concepts data science community to achieve such a Data science (the data science research paradigm) definiUon. To problem solving based on its unique ability to contribute to an iniUal assessment and definiUon computaUonally analyze data to discover insights of data science, this paper proposes an iniUal into moUvaUng domain problems where the axiology of data science. A comprehensive data science axiology is (i.e., learning from data) of data science research A meta technology is used to produce new technology and knowledge hence can be applicable to most human endeavors. Data about, discover, arUculate, and validate the true science results are probabilis5c, correla5onal, nature of the ul5mate ques5ons about natural, possibly fragile or specific to the analysis method observable phenomena as new knowledge about or dataset, cannot be proven complete or correct, those phenomena. ScienUfic results are defini5ve, and lack explana5ons and interpreta5ons for the conclusive, casual, robust, universal knowledge of mo5va5ng domain problem[46]. Like all research paradigms, science and discovery conducted by applying the data science data science are complementary.


Machine Learning For Data Science Using MATLAB

#artificialintelligence

MATLAB is a widely used programming language for statistical computing. This course is for you if you want to have a real feel of the Machine Learning techniques without having to learn all the complicated maths. Additionally, this course is also for you if you have had previous hours and hours of machine learning theory but could never got a change or figure out how to implement and solve data science problems with it. The approach in this course is very practical and we will start everything from very scratch. We will immediately start coding after a couple of introductory tutorials and we try to keep the theory to bare minimal.


Unlocking the value of artificial intelligence and machine learning

#artificialintelligence

In an era of accelerated digitalisation, artificial intelligence (AI) and machine learning (ML) have fast become part of the IT infrastructure of many businesses. Consequently, how these technologies are being used to derive meaningful insights from vast quantities of data is maturing rapidly. "Early on, when organisations didn't have access to the computing power and zettabytes of data that they have today, AI was only springing up in pockets," says Vaidya JR, SVP and global head of data and AI at IT transformation specialist Hexaware Technologies, "The approach then was to see what AI could do for a company, without truly identifying a well-defined problem. Data science solutions were just a shot in the dark. "Organisations were struggling to put their data to effective use, which led to limited value generated and ineffectual business results," he adds. "You can crunch any amount of data, and create numerous models; it only adds value if there is a significant impact on the business.


Your Data Science Problems are Engineering Problems

#artificialintelligence

In 2011, Marc Andreessen famously wrote "Software is eating the world"--it's true. In 2022, data science and machine learning are eating software, so you should care because it matters to any modern digital business. If you work in fintech and want to use machine learning in your product, or you work at a startup and want to invest in an ML team, then read this post! Subscribe for free to receive new posts and support my work. And if you're an executive thinking about using DS & ML to drive impact in your business, this is especially for you.


A Layman's Guide to Data Science Workflow

#artificialintelligence

When you get involved in a data science project, you must always take care of basic elements first before starting a project like business objective, domain knowledge, standard data science practices of an organization, and previous experiences while considering the next steps to problem solutions like data source identification, data modeling, data management, and data visualizations. The data science industry already offers a variety of data science workflow frameworks to solve different kinds of data science problems. It is not possible to develop an all-inclusive Data Science Workflow to solve all business problems. In lieu of that, it is important to follow some best-standard data science practices, such as automating data pipelines, planning inferences, and doing a post-mortem at the end of every project to identify any potential improvement areas. You will learn about various standard data science workflows in this article. You will also gain an understanding of the structure of a Data Science Workflow and the considerations that need to be taken into account as you follow the Data Science Workflow.


Goal Setting in Data Science

#artificialintelligence

In the digital economy, data is the new gold– indeed, there's a new gold rush -- for businesses. To obtain value from gold, the raw material first needs to be processed -- minted into coins or fashioned into jewelry and other products that consumers desire to own and purchase. Similarly, data needs to be processed -- manipulated and analyzed -- to extract real business value. And this is where data science comes in. Data scientists are the prospectors and the tools they use are the innovations that make them more effective.


Decoding the Top 10 Data Science Jargons For Beginners (Commonly Asked In Interviews)

#artificialintelligence

This article is about decoding some of the popular jargon used in data science. It is important to understand these concepts better. They are commonly asked in data science job interviews. Let's get into the topics. A dependent variable (target variable) is driven by the independent variables in the study.


A structured approach to solving data science problems!

#artificialintelligence

While working on numerous data science projects, I have seen that that most data scientists adopt a haphazard approach when they work on a data science problem. While it is understandable that Data Science is both art as well as science, it is important to have some method to the madness. Most people I see just take the data and start throwing algorithms, hoping that they would achieve success through brute force. In most cases, this does not result into a favorable outcome. The business users also get frustrated and then it would appear that data science is nothing more than a fad.